Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvstreamer: remove temporary allocations for []Result #82387

Merged
merged 2 commits into from
Jun 30, 2022

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jun 3, 2022

kvstreamer: refactor the loop of processing the batch response

This commit refactors the code which processes the batch response in
order to separate out two different concerns:

  • processing non-empty responses in order to create Results
  • processing incomplete responses to populate the resume request.

Now each of these "concerns" is handled in a separate loop making it
easier to reason about, especially so in the following commit.

This commit also extracts out multiple return arguments of a function
into a struct as well as updates some of the comments and moves some of
the error-checking to an earlier stage.

Release note: None

kvstreamer: remove temporary allocations for []Result

Previously, the worker goroutine would accumulate all Results it
can create based on the KV response into a slice, and then the
slice would be passed into the results buffer. At that point, the
slice would be discarded since the results buffer would copy all
Results into its internal state.

This commit refactors the streamer as well as the results buffer to
avoid this temporary allocation of []Result. The main idea is that
Results are now passed one-by-one into the results buffer. The worker
goroutine now acquires the results buffer's mutex, processes the KV
responses one at a time, and whenever a Result is created, it is added
into the results buffer right away. However, in order to prevent the
results buffer from eagerly returning a single Result on GetResults
call, the streamer's user goroutine won't be woken up, until a newly
introduced doneAddingLocked method is called by the worker goroutine.

Some care needs to be taken to prevent deadlocks with all of the
mutexes. Now, since we're finalizing the results one at a time, we might
need to hold the streamer's mutex (so that we can set scanComplete
correctly), and that mutex must be acquired before the results buffer's
one.

This change shows a modest improvement on the microbenchmarks but is
a lot more important on analytical, TPCH-like queries, where this
[]Result is one of the largest sources of garbage.

name                                                    old time/op    new time/op    delta
IndexJoin/Cockroach-24                                    5.98ms ± 1%    5.95ms ± 1%    ~     (p=0.079 n=9+10)
IndexJoin/MultinodeCockroach-24                           7.55ms ± 1%    7.59ms ± 1%  +0.47%  (p=0.015 n=8+9)
IndexJoinColumnFamilies/Cockroach-24                      8.68ms ± 3%    8.56ms ± 2%    ~     (p=0.133 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24             11.8ms ± 5%    11.7ms ± 3%    ~     (p=0.315 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             6.67ms ± 1%    6.69ms ± 1%    ~     (p=0.315 n=10+9)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    7.87ms ± 1%    7.92ms ± 1%  +0.73%  (p=0.015 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               9.30ms ± 2%    9.31ms ± 4%    ~     (p=0.796 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      10.9ms ± 4%    10.9ms ± 2%    ~     (p=0.971 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         8.99ms ± 1%    9.03ms ± 4%    ~     (p=0.549 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                12.1ms ± 4%    11.9ms ± 6%    ~     (p=0.143 n=10+10)
LookupJoinOrdering/Cockroach-24                           10.9ms ± 3%    10.8ms ± 3%    ~     (p=0.243 n=10+9)
LookupJoinOrdering/MultinodeCockroach-24                  14.2ms ± 5%    13.9ms ± 3%    ~     (p=0.113 n=10+9)

name                                                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                                    1.36MB ± 1%    1.31MB ± 0%  -3.61%  (p=0.000 n=10+9)
IndexJoin/MultinodeCockroach-24                           2.07MB ± 2%    2.04MB ± 3%    ~     (p=0.063 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                      1.43MB ± 1%    1.38MB ± 0%  -3.56%  (p=0.000 n=9+9)
IndexJoinColumnFamilies/MultinodeCockroach-24             2.27MB ± 1%    2.22MB ± 2%  -2.09%  (p=0.000 n=8+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.71MB ± 0%    1.67MB ± 0%  -2.70%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.43MB ± 5%    2.35MB ± 1%  -3.31%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               1.72MB ± 1%    1.62MB ± 1%  -6.20%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      2.39MB ± 2%    2.30MB ± 3%  -3.53%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         1.79MB ± 1%    1.74MB ± 1%  -2.80%  (p=0.000 n=10+9)
LookupJoinNoOrdering/MultinodeCockroach-24                2.35MB ± 3%    2.32MB ± 2%    ~     (p=0.079 n=10+9)
LookupJoinOrdering/Cockroach-24                           1.63MB ± 1%    1.53MB ± 1%  -5.77%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  2.30MB ± 4%    2.23MB ± 2%  -3.41%  (p=0.002 n=9+8)

name                                                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                                     7.15k ± 1%     7.16k ± 1%    ~     (p=0.888 n=10+9)
IndexJoin/MultinodeCockroach-24                            11.9k ± 2%     11.9k ± 2%    ~     (p=0.968 n=10+9)
IndexJoinColumnFamilies/Cockroach-24                       11.9k ± 0%     11.9k ± 0%    ~     (p=0.075 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24              17.6k ± 1%     17.5k ± 1%    ~     (p=0.566 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              9.86k ± 1%     9.88k ± 1%    ~     (p=0.150 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     14.1k ± 0%     14.1k ± 1%    ~     (p=0.055 n=8+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24                12.6k ± 1%     12.5k ± 1%  -0.77%  (p=0.005 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24       17.2k ± 1%     17.0k ± 0%  -0.88%  (p=0.000 n=10+8)
LookupJoinNoOrdering/Cockroach-24                          12.3k ± 1%     12.3k ± 1%    ~     (p=0.929 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                 16.8k ± 1%     16.8k ± 1%    ~     (p=0.968 n=9+10)
LookupJoinOrdering/Cockroach-24                            14.5k ± 1%     14.5k ± 1%    ~     (p=0.271 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                   19.4k ± 1%     19.3k ± 1%    ~     (p=0.056 n=9+8)

Addresses: #82160.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich changed the title Streamer optimizations kvstreamer: misc optimizations Jun 3, 2022
@yuzefovich yuzefovich force-pushed the streamer-optimizations branch 16 times, most recently from 2a23775 to 0da4952 Compare June 9, 2022 17:37
@yuzefovich yuzefovich changed the title kvstreamer: misc optimizations kvstreamer: miscellaneous optimizations and memory accounting Jun 9, 2022
@yuzefovich yuzefovich marked this pull request as ready for review June 9, 2022 17:41
@yuzefovich yuzefovich requested review from a team as code owners June 9, 2022 17:41
Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass. The optimizations look like they're paying off! Nice work.

Reviewed 2 of 3 files at r12, 3 of 3 files at r14, 1 of 2 files at r15, 1 of 4 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rharding6373, and @yuzefovich)


-- commits line 163 at r13:
Can we avoid the code duplication from reimplementing two sets of heap methods by making a helper that can implement a min heap that works for both?


-- commits line 201 at r14:
Turning this into a LIFO queue is a neat optimization.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r12 (raw file):

func (b *outOfOrderResultsBuffer) doneAddingLocked() {
	b.Mutex.AssertHeld()
	b.signal()

How much of a performance hit is it to signal at the end of addLocked every time it's called and remove doneAddingLocked for reduced complexity?

Edit: I see in a later commit doneAddingLocked is also used for budgeting purposes, so getting rid of it may not be optimal anyway.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r15 (raw file):

func (b *outOfOrderResultsBuffer) doneAddingLocked() {
	b.Mutex.AssertHeld()

Isn't it still advisable to hold the lock while signaling on the channel?


pkg/kv/kvclient/kvstreamer/results_buffer.go line 297 at r16 (raw file):

func (b *outOfOrderResultsBuffer) doneAddingLocked(ctx context.Context) {
	b.accountForOverheadLocked(ctx, int64(cap(b.results))*resultSize)

If I understand this correctly, we can go into memory debt doing this. But in the previous commit we reuse results, right? So we also never reduce the capacity of results, which may cause us to stay in debt?


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

				result.memoryTok.toRelease = getResponseSize(get)
				memoryTokensBytes += result.memoryTok.toRelease
				s.results.addLocked(result)

I'm a bit concerned about having calls to addLocked deep in here when the lock is acquired in a conditional above, and that the conditions being the same aren't immediately obvious. The assertion that the lock is held in addLocked helps, but this seems complicated. Will think about this more, nothing actionable atm.

@yuzefovich yuzefovich force-pushed the streamer-optimizations branch from 0da4952 to a675884 Compare June 22, 2022 23:05
@yuzefovich yuzefovich requested review from a team as code owners June 22, 2022 23:05
@yuzefovich yuzefovich requested review from a team and removed request for a team June 22, 2022 23:05
@yuzefovich yuzefovich removed request for a team and stevendanna June 22, 2022 23:05
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)


-- commits line 163 at r13:

Previously, rharding6373 (Rachael Harding) wrote…

Can we avoid the code duplication from reimplementing two sets of heap methods by making a helper that can implement a min heap that works for both?

I don't think we can do this without sacrificing the performance (or, more specifically, without increase in allocations). We could get around this by using generics and then having a single helper, but without the generics, the helper would have to operate on interface{} which escapes to the heap (leading to allocations), and the current code operates on concrete structs singleRangeBatch and inOrderBufferedResult. Another way for having a single helper would be to unify these two structs into one, but that seems extremely dirty.

Thus, I think we should either proceed with the duplicated approach or get rid off the commit altogether. Given the perf wins and reduction in allocations I think the code duplication is worth it.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r12 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

How much of a performance hit is it to signal at the end of addLocked every time it's called and remove doneAddingLocked for reduced complexity?

Edit: I see in a later commit doneAddingLocked is also used for budgeting purposes, so getting rid of it may not be optimal anyway.

I actually had the same thought initially - whether it is worth introducing this doneAddingLocked method to be able to "batch" the results returned on get call (if we signaled on the channel every time in addLocked, in theory get could be returning one result at a time) - and wanted to prototype the approach without doneAddingLocked. But then I added the memory accounting and this method is very convenient place to do that, so I decided it's probably worth keeping it.

The semantics of addLocked and doneAddingLocked seem pretty clean, so I ended up not prototyping what the performance implications / mutex contention would be if we did the memory accounting as well as the signaling on the channel on every addLocked call.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r15 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

Isn't it still advisable to hold the lock while signaling on the channel?

Yeah, we're still holding the lock, and it is checked in accountForOverheadLocked. I removed it here to avoid the duplication but can bring it back if you think it makes things more clear - there should be no performance impact whatsoever for asserting the mutex is held in regular builds.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 297 at r16 (raw file):
You're right that the budget can go into debt (since the results buffer doesn't have a push back mechanism, and it would be sad - and more difficult - to drop already completed results). The idea is that the worker coordinator will see that the budget in debt, so it won't issue any more requests, eventually the caller will consume results by GetResults call, and the budget should get out of debt once those results are Released. Note that here we account only for the overhead of the results (i.e. the shallow size of Result object which is a constant) and not the KV response (which can be arbitrarily large), so it should be relatively unlikely that we get into debt when accounting for the overhead.

But in the previous commit we reuse results, right?

The previous commit reuses the results only in the InOrder mode and only for the return value of get call. In the InOrder mode we need two slices - one is for the buffered results (all results we have received but not returned yet) and another to return on get call (all the results that we can return according to the ordering). The first slice is reused implicitly since we're maintaining a heap over it (meaning that we're removing from the tail and adding to the tail - and then pushing elements down if needed), the second slice is reused explicitly. Initially I forgot to account for the second slice, fixed.

So we also never reduce the capacity of results, which may cause us to stay in debt?

In theory, I guess it's possible, but in reality this should never happen. This could happen if we're using the InOrder mode, the overhead of buffered and resultsScratch is large in comparison to the total budget so that even when we spill results to disk, the overhead (which is always kept in memory) still keeps the budget in debt. I could only see this happen if the budget's limit is too small.

In the OutOfOrder mode the capacity of results can reduce since we are not reusing the slice after get call. (I have thought about trying to reuse it but it's quite complex given that we don't know when results get Released - we might have concurrent addLocked call to add a new result into the reused slice before the corresponding result has been processed by the caller.)


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

I'm a bit concerned about having calls to addLocked deep in here when the lock is acquired in a conditional above, and that the conditions being the same aren't immediately obvious. The assertion that the lock is held in addLocked helps, but this seems complicated. Will think about this more, nothing actionable atm.

I agree that the structure of the code doesn't make it clear that the right locks are held, but I'm not sure how to improve it - any suggestions are very welcome. I did add some test assertions though.

@yuzefovich
Copy link
Member Author

Added another commit. At the moment of writing, the first commit is part of a different PR.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I agree that the structure of the code doesn't make it clear that the right locks are held, but I'm not sure how to improve it - any suggestions are very welcome. I did add some test assertions though.

Drive-by waiting for some code to compile. You've effectively got the same conditional inside this function twice, once at the top, once at the bottom, and then you've got an assumption about its value in the middle. This makes me think one way to make it cleaner is to split it into two different functions, perhaps processEmptySingleRangeResults and what you've got where you check the condition at the top and call down to the special case and then the logic in the body of the current function can assume you need to lock and make that clear etc. Maybe there will be a touch of duplication, but it seems worth it.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

Previously, ajwerner wrote…

Drive-by waiting for some code to compile. You've effectively got the same conditional inside this function twice, once at the top, once at the bottom, and then you've got an assumption about its value in the middle. This makes me think one way to make it cleaner is to split it into two different functions, perhaps processEmptySingleRangeResults and what you've got where you check the condition at the top and call down to the special case and then the logic in the body of the current function can assume you need to lock and make that clear etc. Maybe there will be a touch of duplication, but it seems worth it.

Drive-bys are always welcome!

I read over your suggestion several times, but I'm not sure I follow. Are you suggesting splitting out a single loop over br.Responses into two loops, each in a separate function (one would process responses for which we create results, and another would process responses with the resume spans which we add into the resume request)?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

Are you suggesting splitting out a single loop over br.Responses into two loops, each in a separate function (one would process responses for which we create results, and another would process responses with the resume spans which we add into the resume request)?

Yep. See how these two commits make you feel: https://github.com/yuzefovich/cockroach/compare/streamer-optimizations...ajwerner:streamer-optimizations?expand=1

@yuzefovich yuzefovich force-pushed the streamer-optimizations branch 2 times, most recently from d384520 to 03d3286 Compare June 27, 2022 18:05
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took up Andrew's suggestion and included into a new commit. I also removed most of the commits from this PR for ease of reviewing. The first commit is still part of another PR and should be ignored here. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rharding6373)


pkg/kv/kvclient/kvstreamer/streamer.go line 1372 at r12 (raw file):

Previously, ajwerner wrote…

Are you suggesting splitting out a single loop over br.Responses into two loops, each in a separate function (one would process responses for which we create results, and another would process responses with the resume spans which we add into the resume request)?

Yep. See how these two commits make you feel: https://github.com/yuzefovich/cockroach/compare/streamer-optimizations...ajwerner:streamer-optimizations?expand=1

Makes sense, thanks for typing this out. I was a bit concerned about the performance impact of having an additional loop (since on each request we perform pretty expensive type switches and interface dispatching), but the microbenchmarks don't reveal any perf hit, so I think the cleanup offered by this change is worth it.

@yuzefovich yuzefovich changed the title kvstreamer: miscellaneous optimizations and memory accounting kvstreamer: remove temporary allocations for []Result Jun 27, 2022
@yuzefovich yuzefovich force-pushed the streamer-optimizations branch from 03d3286 to ded11a4 Compare June 28, 2022 15:54
@yuzefovich
Copy link
Member Author

Both commits are relevant to this PR.

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I left a few thoughts below but they're not important.

Reviewed 1 of 1 files at r26, 9 of 9 files at r28, 1 of 9 files at r29, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @michae2, @rharding6373, and @yuzefovich)


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r29 (raw file):

func (b *outOfOrderResultsBuffer) doneAddingLocked() {
	b.Mutex.AssertHeld()
	b.signal()

nit: Now that these are separate methods, you might consider guarding this signal with if len(b.results) > 0.


pkg/kv/kvclient/kvstreamer/streamer.go line 1312 at r29 (raw file):

	s.mu.avgResponseEstimator.update(
		fp.memoryFootprintBytes, int64(fp.numGetResults+fp.numScanResults),
	)

suggestion: How about doing this at the end of this function, with int64(fp.numGetResults + someOtherCounter) where someOtherCounter is incremented whenever result.scanComplete is true?

Oh, you're trying to avoid holding the streamer mutex for the whole function. 🤔 Well, never mind.

Oh, this value is already in resultsBufferBase.numCompleteResponses. Maybe the avgResponseEstimator should be part of resultsBufferBase?

Anyway, not super important.

Code quote:

	// TODO(yuzefovich): some of the responses might be partial, yet the
	// estimator doesn't distinguish the footprint of the full response vs
	// the partial one. Think more about this.
	s.mu.avgResponseEstimator.update(
		fp.memoryFootprintBytes, int64(fp.numGetResults+fp.numScanResults),
	)

Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @michae2, @rharding6373, and @yuzefovich)


pkg/kv/kvclient/kvstreamer/results_buffer.go line 48 at r29 (raw file):

	// moment. The boolean indicates whether all expected Results have been
	// returned. Must be called without holding the budget's mutex.
	get(context.Context) (_ []Result, allComplete bool, _ error)

Random thought: if you want to optimize further (and make the streamer even more like io_uring) you could change get to return a pointer to a single Result rather than returning []Result (or turn it into some kind of result-by-result iterator).

This commit refactors the code which processes the batch response in
order to separate out two different concerns:
- processing non-empty responses in order to create `Result`s
- processing incomplete responses to populate the resume request.

Now each of these "concerns" is handled in a separate loop making it
easier to reason about, especially so in the following commit.

This commit also extracts out multiple return arguments of a function
into a struct as well as updates some of the comments and moves some of
the error-checking to an earlier stage.

Release note: None
Previously, the worker goroutine would accumulate all `Result`s it
can create based on the KV response into a slice, and then the
slice would be passed into the results buffer. At that point, the
slice would be discarded since the results buffer would copy all
`Result`s into its internal state.

This commit refactors the streamer as well as the results buffer to
avoid this temporary allocation of `[]Result`. The main idea is that
`Result`s are now passed one-by-one into the results buffer. The worker
goroutine now acquires the results buffer's mutex, processes the KV
responses one at a time, and whenever a `Result` is created, it is added
into the results buffer right away. However, in order to prevent the
results buffer from eagerly returning a single `Result` on `GetResults`
call, the streamer's user goroutine won't be woken up, until a newly
introduced `doneAddingLocked` method is called by the worker goroutine.

Some care needs to be taken to prevent deadlocks with all of the
mutexes. Now, since we're finalizing the results one at a time, we might
need to hold the streamer's mutex (so that we can set `scanComplete`
correctly), and that mutex must be acquired before the results buffer's
one.

This change shows a modest improvement on the microbenchmarks but is
a lot more important on analytical, TPCH-like queries, where this
`[]Result` is one of the largest sources of garbage.

```
name                                                    old time/op    new time/op    delta
IndexJoin/Cockroach-24                                    5.98ms ± 1%    5.95ms ± 1%    ~     (p=0.079 n=9+10)
IndexJoin/MultinodeCockroach-24                           7.55ms ± 1%    7.59ms ± 1%  +0.47%  (p=0.015 n=8+9)
IndexJoinColumnFamilies/Cockroach-24                      8.68ms ± 3%    8.56ms ± 2%    ~     (p=0.133 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24             11.8ms ± 5%    11.7ms ± 3%    ~     (p=0.315 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             6.67ms ± 1%    6.69ms ± 1%    ~     (p=0.315 n=10+9)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    7.87ms ± 1%    7.92ms ± 1%  +0.73%  (p=0.015 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               9.30ms ± 2%    9.31ms ± 4%    ~     (p=0.796 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      10.9ms ± 4%    10.9ms ± 2%    ~     (p=0.971 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         8.99ms ± 1%    9.03ms ± 4%    ~     (p=0.549 n=9+10)
LookupJoinNoOrdering/MultinodeCockroach-24                12.1ms ± 4%    11.9ms ± 6%    ~     (p=0.143 n=10+10)
LookupJoinOrdering/Cockroach-24                           10.9ms ± 3%    10.8ms ± 3%    ~     (p=0.243 n=10+9)
LookupJoinOrdering/MultinodeCockroach-24                  14.2ms ± 5%    13.9ms ± 3%    ~     (p=0.113 n=10+9)

name                                                    old alloc/op   new alloc/op   delta
IndexJoin/Cockroach-24                                    1.36MB ± 1%    1.31MB ± 0%  -3.61%  (p=0.000 n=10+9)
IndexJoin/MultinodeCockroach-24                           2.07MB ± 2%    2.04MB ± 3%    ~     (p=0.063 n=10+10)
IndexJoinColumnFamilies/Cockroach-24                      1.43MB ± 1%    1.38MB ± 0%  -3.56%  (p=0.000 n=9+9)
IndexJoinColumnFamilies/MultinodeCockroach-24             2.27MB ± 1%    2.22MB ± 2%  -2.09%  (p=0.000 n=8+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24             1.71MB ± 0%    1.67MB ± 0%  -2.70%  (p=0.000 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24    2.43MB ± 5%    2.35MB ± 1%  -3.31%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24               1.72MB ± 1%    1.62MB ± 1%  -6.20%  (p=0.000 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24      2.39MB ± 2%    2.30MB ± 3%  -3.53%  (p=0.000 n=10+10)
LookupJoinNoOrdering/Cockroach-24                         1.79MB ± 1%    1.74MB ± 1%  -2.80%  (p=0.000 n=10+9)
LookupJoinNoOrdering/MultinodeCockroach-24                2.35MB ± 3%    2.32MB ± 2%    ~     (p=0.079 n=10+9)
LookupJoinOrdering/Cockroach-24                           1.63MB ± 1%    1.53MB ± 1%  -5.77%  (p=0.000 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                  2.30MB ± 4%    2.23MB ± 2%  -3.41%  (p=0.002 n=9+8)

name                                                    old allocs/op  new allocs/op  delta
IndexJoin/Cockroach-24                                     7.15k ± 1%     7.16k ± 1%    ~     (p=0.888 n=10+9)
IndexJoin/MultinodeCockroach-24                            11.9k ± 2%     11.9k ± 2%    ~     (p=0.968 n=10+9)
IndexJoinColumnFamilies/Cockroach-24                       11.9k ± 0%     11.9k ± 0%    ~     (p=0.075 n=9+10)
IndexJoinColumnFamilies/MultinodeCockroach-24              17.6k ± 1%     17.5k ± 1%    ~     (p=0.566 n=10+10)
LookupJoinEqColsAreKeyNoOrdering/Cockroach-24              9.86k ± 1%     9.88k ± 1%    ~     (p=0.150 n=9+10)
LookupJoinEqColsAreKeyNoOrdering/MultinodeCockroach-24     14.1k ± 0%     14.1k ± 1%    ~     (p=0.055 n=8+10)
LookupJoinEqColsAreKeyOrdering/Cockroach-24                12.6k ± 1%     12.5k ± 1%  -0.77%  (p=0.005 n=10+10)
LookupJoinEqColsAreKeyOrdering/MultinodeCockroach-24       17.2k ± 1%     17.0k ± 0%  -0.88%  (p=0.000 n=10+8)
LookupJoinNoOrdering/Cockroach-24                          12.3k ± 1%     12.3k ± 1%    ~     (p=0.929 n=10+10)
LookupJoinNoOrdering/MultinodeCockroach-24                 16.8k ± 1%     16.8k ± 1%    ~     (p=0.968 n=9+10)
LookupJoinOrdering/Cockroach-24                            14.5k ± 1%     14.5k ± 1%    ~     (p=0.271 n=10+10)
LookupJoinOrdering/MultinodeCockroach-24                   19.4k ± 1%     19.3k ± 1%    ~     (p=0.056 n=9+8)
```

Release note: None
@yuzefovich yuzefovich force-pushed the streamer-optimizations branch from ded11a4 to 6343df3 Compare June 30, 2022 05:37
@yuzefovich yuzefovich dismissed rharding6373’s stale review June 30, 2022 05:38

Feedback has been addressed.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @michae2, and @rharding6373)


pkg/kv/kvclient/kvstreamer/results_buffer.go line 48 at r29 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Random thought: if you want to optimize further (and make the streamer even more like io_uring) you could change get to return a pointer to a single Result rather than returning []Result (or turn it into some kind of result-by-result iterator).

I think I did prototype such an idea at some point (hoping to remove the allocation needed for []Result in the InOrder mode as well as to reuse the results slice in the OutOfOrder mode), but it showed a regression in performance. Added a TODO to try that again.


pkg/kv/kvclient/kvstreamer/results_buffer.go line 271 at r29 (raw file):

Previously, michae2 (Michael Erickson) wrote…

nit: Now that these are separate methods, you might consider guarding this signal with if len(b.results) > 0.

Hm, I think I'll keep it the way it is - the only downside of signaling when there are no results (which shouldn't really happen) is that the streamer's user gorouting will wake up in Streamer.GetResults only to go back to sleep, meaning there is little harm.


pkg/kv/kvclient/kvstreamer/streamer.go line 1312 at r29 (raw file):

you're trying to avoid holding the streamer mutex for the whole function.

Yeah, I tried to optimize the locking pattern so that we don't hold locks for longer than needed and don't acquire the same lock multiple times inside of the same function.

this value is already in resultsBufferBase.numCompleteResponses.

Not exactly - fp.numGetResults+fp.numScanResults is the number of Result objects created based on a single BatchRequest that targeted a single range whereas resultsBufferBase.numCompleteResponses tracks the number of "complete" requests (which could be spanning multiple ranges) for the current Enqueue batch of requests. numCompleteResponses is reset across Streamer.Enqueue calls whereas the avg response size estimator is updated throughout the lifetime of the Streamer. So avgResponseEstimator.numResponses does equal resultsBufferBase.numCompleteResponses, but only before the second call to Enqueue.

Maybe the avgResponseEstimator should be part of resultsBufferBase?

Hm, it's an interesting question. I do see some similarities in these components (e.g. that they handle the results from the BatchResponses), but I think it's cleaner to keep them apart. I think in the future (e.g. when we hook up the Streamer to power regular scans - i.e. TableReaders / cFetchers) the avg response estimator will need to be more sophisticated and it'll probably be better if that complicated logic doesn't make the code around the results buffer more complicated too.

@craig
Copy link
Contributor

craig bot commented Jun 30, 2022

Build failed:

@yuzefovich
Copy link
Member Author

Seems like acceptance flake.

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 30, 2022

Build succeeded:

@craig craig bot merged commit ccd6f76 into cockroachdb:master Jun 30, 2022
@yuzefovich yuzefovich deleted the streamer-optimizations branch June 30, 2022 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants